Introduction

Background and dataset

The objective of this report is to determine which covarites that can be used to predict if a US county has a low or high crime rate (serious crimes per 1000 inhabitants). The dataset used to do this was county demographic information (CDI) for 440 of the most populous counites in the US 1990-1992. Each county records includes data on the 14 variables listed below in table . Counties with missing data has been removed from the dataset.

CDI dataset columns
Variable Description
id identification number, 1–440
county county name
state state abbreviation
area land area (square miles)
popul estimated 1990 population
pop1834 percent of 1990 CDI population aged 18–34
pop65plus percent of 1990 CDI population aged 65 years old or older
phys number of professionally active nonfederal physicians during 1990
beds total number of beds, cribs and bassinets during 1990
crimes total number of serious crimes in 1990
higrads percent of adults (25 yrs old or older) who completed at least 12 years of school
bachelors percent of adults (25 yrs old or older) with bachelor’s degree
poors Percent of 1990 CDI population with income below poverty level
unemployed percent of 1990 CDI labor force which is unemployed
percapitaincome per capita income of 1990 CDI population (dollars)
totalincome total personal income of 1990 CDI population (in millions of dollars)
region Geographic region classification used by the U.S. Bureau of the Census,
including Northeast, Midwest, South and West

In order to measure crime rate, another varible called crm1000 was added, descibing the number of serious crimes per 1000 inhabitants. Using crm1000, counties were divided into counties with high or non-high crime rate, where counties with crime rate higher than the median of crm1000 in the dataset were categorized as having a high crime rate. This crime status of the county was stored in another column called hircrm, which takes the value 1 if the county is a high crime county and 0 if it is a low crime county. In this paper, this binary varible will be used as the dependent varible. Similar to crime rate, a variable phys1000 was also added, measuring the number of physicians per 1000 inhabitants.

Model

The binary dependent variable was modelled using a logistic regression model. This model assumes that the log-odds of a certain observation \(i\) is a linear combination of its covariates \(X_{j,i}\) and parameters \(\beta_i\). As such, the model looks like: \[\begin{equation} \ln{\frac{p_i}{1 - p_i}} = \beta_0 + \sum_j\beta_{j} \cdot X_{j,i} \end{equation}\]

Analysis

The higrad model

Introduction

The first model considered has higrads as the sole covariate. As such, the model becomes:

\[\begin{equation} \ln{\frac{p_i}{1 - p_i}} = \beta_0 + \beta_{higrads} \cdot X_{higrads,i} \end{equation}\]

In order to determine if there is a relationship between hicrm and higrads they are plotted against each other in figure . Because hicrm is a binary varible it is very difficult to determine if there is a relationship only by the pattern in the plot. In order to clarify this relationship, a kernel smoother was added to the plot, where a smooth line was attained with a bandwidth of 20. In addition, the fitted model along with its 95 % confidence interval are included.

\label{fig:hicrm_higrads}Plot of \texttt{hicrm} against \texttt{higrads}, including kernel smoothing and prediction of fitted model with 95 \% confidence interval

Plot of against , including kernel smoothing and prediction of fitted model with 95 % confidence interval

As seen in , the kernel curve looks S-shaped, implying that a logistic model may be appropriate to describe the relationship. Further, the S-shape is “downward facing”, implying a negative \(\beta_{higrads}\), i.e. that the probability of a county being classified as a high crime decreases when the amount of higrads in the county increases. Another thing to note in figure is how few data points exists with higrads below 60, meaning that significance is low in this region. In addition, looking at points with higrads over 60, the kernel curve and fitted line only cover about 25% - 75%, implying that higrads may not predict hicrm well.

Fitted model and significance

In order to study the significance of the model, the \(\beta\) values together with their 95 % confidence inteval are presented in table .

\(\beta\)-values of model, with 95 % confidence inteval
Estimate 2.5 % 97.5 % P-value
\(\beta_0\) 3.98 1.81 6.25 0.00044
\(\beta_{higrads}\) -0.05 -0.08 -0.02 0.00041

As seen in the table, all of the P-values are > 0.05, meaning that that higrads has a statistically significant effect on hicrm. This effect can be measured by looking at \(e^{\beta_{higrads}}\), showing that an increase of 1% in higrads decreases odds of hicrm by 5%, while an increase of 10% decreases odds of by 40.1%.

Model predictions

Using the higrads model: the probability, with 95 % confidence interval, of having a high crime rate in a county where the amount of higrads is 65 (percent), and where it is 85 (percent) is predicted. The result may be found in table .

Predictions of higrads model
Higrads Probability (%) 2.5 % 97.5 %
65 65.6 55.9 74.2
85 40.6 34.1 47.6

Model performance analysis

In order to analyze model performance, the sensitivity and specificity of the model was calculated. The sensitivity of a model is the ratio of predicted positives to real positives in the dataset, while the specificity of a model is the ratio of predicted negatives to real negatives in the dataset. As such, the higher the value of the sensitivity and specificity, the better.

The sensitivity of the model was 55.5% and the specificity of the model was 57.3%, implying that the model does not clasify the crime rate of the counties rather successfully.

The region model

Introduction

Next, a logistic model was adopted based on the region covariate. Since region is not continous, but categorial, it is modelled using “dummy variables” \(X_i\). In order to implement this effectively, one of the categories is chosen as a reference variable, and the effects of other categories are measured in comparison to it.

In order to determine this reference variable, a cross-tabulation of the data between region and hirm is studied, see table .

Cross-tabulation between and
Low crime High crime
Northeast 82 21
Midwest 64 44
South 44 108
West 30 47

As a reference region, the one that has the largest number of counties in it’s smallest low/high category was chosen. As a tie-breaker, the other low/high category was used. This approach produces the lowest standard error, and therefore highest significance. As seen in table , the above given condition results in choosing South as reference region.

Using this reference region, the logistic model becomes \[\begin{equation} \ln{\frac{p_i}{1 - p_i}} = \beta_0 + \beta_{Northeast} \cdot X_{Northeast,i} + \beta_{Midwest} \cdot X_{Midwest,i} + \beta_{West} \cdot X_{West,i} \end{equation}\]

Here, the \(\beta\) coefficients are measured relative to South, while \(\beta_0\) is log-odds coefficient for South.

Fitted model and significance

The model was fit with the given data set, estimating \(\beta_i\), shown together with its 95 % confidence interval and P-value in table .

\(\beta\)-estimates for the model, together with 95 % confidence interval and P-values
Estimate 2.5 % 97.5 % P-value
\(\beta_0\) 0.90 0.56 1.26 0.000
\(\beta_{Northeast}\) -2.26 -2.87 -1.68 0.000
\(\beta_{Midwest}\) -1.27 -1.80 -0.76 0.000
\(\beta_{West}\) -0.45 -1.03 0.13 0.127

As may be seen in table , the P-values for \(\beta_{West}\) is not less than 0.05, indicating a lack of statistical significance in the difference between how South and West predicts hivrm.

Next, the odds-ratios for the different categories where determined. The odds-ratios measure the odds of a particular category in relation to the reference category. These may be calculated as \(OR_i = e^{\beta_i}\) and may be seen in table .

Odds-ratios for the model, together with 95 % confidence interval
OR 2.5 % 97.5 %
Northeast 0.10 0.06 0.19
Midwest 0.28 0.17 0.47
West 0.64 0.36 1.14

As seen in table , the odds-ratios are less than 1 for all categories but the reference region. This implies that the odds for all regions are lower compared to the reference region, i.e. that the probability of a high crime rate is lower in all regions compared to the reference region. This is consistent with table .

Model predictions

Using the fitted model, the probabilies of having a high crime rate, with confidence interval, for the different regions was determined, shown in table .

Probability of high crime rate (%), together with 95 % confidence interval for each of the regions
Probability (%) 2.5 % 97.5 %
Northeast 20.4 12.6 28.2
Midwest 40.7 31.5 50.0
South 71.1 63.8 78.3
West 61.0 50.1 71.9

Model performance analysis

For the region model, the sensitivity was 70.5%, while the specificity was 66.4%.

Comparing the two models analyzed so far, the region model performs better measured on sensitivity and specificity, as seen in table

Comparison of sensitivity and specificity of and model
Covariate Sensitivity (%) Specificity (%)
Higrads 55.5 57.3
Region 70.5 66.4

Combined model and comparison

Introduction

Next a model that uses both higrads and region is analyzed, i.e. the following model:

\[\begin{equation} \ln{\frac{p_i}{1 - p_i}} = \beta_0 + \beta_{Northeast} \cdot X_{Northeast,i} + \beta_{Midwest} \cdot X_{Midwest,i} + \beta_{West} \cdot X_{West,i} + \beta_{higrads} \cdot X_{higrads,i} + \epsilon_i \end{equation}\]

Model comparison

In order to compare the models, metrics other than sensitivity and specificity may be studied. Some of these are AIC and BIC, which are penalized-likelihood criteria and Nagelkerke psuedo \(R^2\), which is a measurment that increses up to 1 the better the model fit is. As they are defined, AIC and BIC should be as low as possible for a model to be performant, while psuedo \(R^2\) should be as high as possible.

Comparison between the models in regards to AIC, BIC and Psuedo \(R^2\) are seen in figure , while comparison of sensitivity and specificity is seen in table .

\label{fig:comparison_region_higrads_both}Comparison of AIC and BIC and Nagelkerke psuedo $R^2$ for the different models

Comparison of AIC and BIC and Nagelkerke psuedo \(R^2\) for the different models

Comparison of sensitivity and specificity of models
Covariate Sensitivity (%) Specificity (%)
Higrads 55.5 57.3
Region 70.5 66.4
Both 70.5 67.3

As seen in figure and table , the combined model with both the covariates performs the best on all the studied metrics.

Combined model performance

Performance of the combined model can be analyzed by studying a QQ-plot (see figure ) the squared standardized Pearson residuals and the standardized deviance residuals against the linear predictor \(x^{\beta}\) (see figure ). As well as the Cook’s distance against the linear predictor, and against higrads and against region (see figure ).

\label{fig:qq-plot_combined}QQ-plot for the combined model

QQ-plot for the combined model

The QQ-plot seem to follow a bimodal distribution.

\label{fig:residuals_combined}Squared standardized Pearson residuals as well as standardized deviance residuals for the combined model, against the linear predictor $x^{\beta}$

Squared standardized Pearson residuals as well as standardized deviance residuals for the combined model, against the linear predictor \(x^{\beta}\)

\label{fig:residuals_combined}Squared standardized Pearson residuals as well as standardized deviance residuals for the combined model, against the linear predictor $x^{\beta}$

Squared standardized Pearson residuals as well as standardized deviance residuals for the combined model, against the linear predictor \(x^{\beta}\)

For a standardized pearson residual to be considered suspiciously large, \(|r_i| > |\lambda_{\alpha/2}| \approx 2\). This is true for 8 of the standadized pearson residuals, which can be seen in figures XX and XX, where the standardized pearson residuals are plotted against the linear predictor. For a deviance residual to be considered too large $|r_i| > 2 $. This is never the case which is verified by figure XX where the standardized devience residuals are plotted against the linear predictor.

In order to measure how the \(\beta\)-estimates were influenced by indiviual observations Cook’s distance for logistic regression was calcultaed and plotted.

\label{fig:cooks_combined}Cook's distance, for the combined model, against linear predictor, region as well as higrads

Cook’s distance, for the combined model, against linear predictor, region as well as higrads

As can be seen in diagrams XX, XX, XX the vast majoriy of the the observations is below the horizontal line. The amount of counties with a high Cook’s distance is limited to 3-4. The data entry with the largest Cook’s distance is found in the South region while the rest of the counties with a high Cook’s distance is found in the West region. The Cook’s distance daigrams give no concern to why the combined model should be amiss

Interaction model

Introduction

As a forth model, interaction terms are also considered, building in that the effect of higrads may be different in different regions, where the log-odds in the model includes interaction terms such as \(\beta_{Northeast * higrads} \cdot X_{higrads,i}\).

Model performance

The performance of the interaction model compared to the combined model may be analyzed by the likelihood test. This test is similar to a partial F-test, but adapted for logistical regression, using likelihoods since sums of squares are not applicable. The likelihood test results in the interaction model being significantly better than the combined model, with a P-value of 0.019.

In addition, the AIC, BIC, Nagelkerke, sensitivity and specificity is compared to the combined model, in table . Performance of the interaction model can be analyzed by studying a QQ-plot (see figure ) the squared standardized Pearson residuals and the standardized deviance residuals against the linear predictor \(x^{\beta}\) (see figure ). As well as the Cook’s distance against the linear predictor, and against higrads and against region (see figure ).

Comparison of sensitivity and specificity of models
Covariate AIC BIC Sensitivity (%) Specificity (%) Pseudo R2
Combined model 537 558 70 67 0.23
Interaction model 533 566 72 68 0.25
\label{fig:qq-plot_interaction}QQ-plot for the integration model

QQ-plot for the integration model

\label{fig:residuals_interaction}Squared standardized Pearson residuals as well as standardized deviance residuals for the interaction model, against the linear predictor $x^{\beta}$

Squared standardized Pearson residuals as well as standardized deviance residuals for the interaction model, against the linear predictor \(x^{\beta}\)

The interaction has more Person residuals that can be considered suspisiously large than the combined model. Furthermore it has standardized devience residuals that are too large. This questions the of the credability of the interaction model.

\label{fig:cooks_interaction}Cook's distance, for the interaction model, against linear predictor, region as well as higrads

Cook’s distance, for the interaction model, against linear predictor, region as well as higrads

TODO ändra i analysen!

Both the interaction and the combined model perform better on some metrics, while performing worse on other. The interaction model has a lower AIC value but at the same time higher BIC-value. This is too extraordinary since BIC penalizes larger more complex models more than the AIC value tends to do. The sensitivity, specificity and Pseudo \(R^2\) values are worse for the Combined model. The interaction model performs considerably worse than the combined model on the Cook’s distance acount. The interaction model have far more outliers and outliers with a much higher Cook’s distance.

As aforementioned, the interaction model has some credibility issues with its residuals. Moreover the interaction model has far more outliers that as well have markedly larger Cook’s distance than the combined model. At the same time the interaction model outperforms the combined model on likelyhood ratio test, AIC, Sensitivity, Specificity and Pseudo R2. The outliers in the interaction model might therefor not effect the model to much. In the end, the interaction model was favoured over the combined model.

Finding the optimal model

Methology

Next, an attempt to fit an optimal model to predict high crime rates is made, using the previous covariates, as well as poors and pshys1000. Finding the optimal model, interaction terms were disregarded.

Models of increasingly complexity, adding more covariates are compared to each other on the used metrics, i.e. AIC, BIC, Pseudo \(R^2\), sensitivity and specificity. In addition, the result of automatic selection using R step function is studied.

Model comparison

AIC, BIC and Pseudo \(R^2\) of the studied model are shown in figure . In addition, table includes sensitivity and specificity for the different models.

\label{fig:comparison_optimal}Comparison of AIC and BIC and Nagelkerke psuedo $R^2$ for the different models. Key: H = \texttt{higrads}, R = \texttt{region}, Po = \texttt{poors}, Phy = \texttt{phys1000}

Comparison of AIC and BIC and Nagelkerke psuedo \(R^2\) for the different models. Key: H = , R = , Po = , Phy =

Comparison of sensitivity and specificity of models. Key: H = , R = , Po = , Phy =
Model AIC BIC Sensitivity (%) Specificity (%) Pseudo R2
H 601 609 55 57 0.04
H + R 537 558 70 67 0.23
H + R + Po 494 518 73 75 0.34
H + R + Po + Phy 481 509 74 75 0.37
R + Po + Phy 480 504 75 74 0.37
H + R + Phy 508 532 72 72 0.30

The results in and show that the region + poors + phys1000 model performs best on most of the metrics. This result is also consistent with the step algorithm results. As such, this model is considered the for this problem.

Model performance

Performance of the optimal is then analyzed by studying a QQ-plot (see figure ) the squared standardized Pearson residuals and the standardized deviance residuals against the linear predictor \(x^{\beta}\) (see figure ). As well as the Cook’s distance against the linear predictor, and against higrads and against region (see figure ).

\label{fig:qq-plot_optimal}QQ-plot for the optimal model

QQ-plot for the optimal model

The QQ plot of the optimal model seem to follow a normal distribution skewed to the left.

\label{fig:residuals_optimal}Squared standardized Pearson residuals as well as standardized deviance residuals for the optimal model, against the linear predictor $x^{\beta}$

Squared standardized Pearson residuals as well as standardized deviance residuals for the optimal model, against the linear predictor \(x^{\beta}\)

There is one particular outlier that seem to generate an exceptionaly large Pearson residual. Furthermore there are a few Pearson Resdiuals that have a suspisiously larege value. As for the devience residuals, some of them are too large as well. Disregarding the most extreme outlier, the residuals are not far worse off than the residuals of the interaction model. The most extreme outlier is the Olmsted county. In the next section the potential influence of individual observations will be adressed, by plotting the Cook’s disatance.

\label{fig:cooks_optimal}Cook's distance, for the optimal model, against linear predictor, region as well as higrads

Cook’s distance, for the optimal model, against linear predictor, region as well as higrads

The outlier is Olmsted, see figure for a plot of Cook’s distance excluding this point. This is the aforementioned data entry that resulted in the extreme residuals. Table compares the performance of these models.

\label{fig:cooks_optimal_corrected}Cook's distance, for the optimal model, against linear predictor, region as well as higrads, excluding outlier Olmsted

Cook’s distance, for the optimal model, against linear predictor, region as well as higrads, excluding outlier Olmsted

Removing the problematic data entry from the CDI data frame, and refitting the optimal model resulted in a substantial improvement in the resdiual plots, see figures XX and XX, and in the the Cook’s distance plots, see figures XX and XX. Comparing the Cook’s distsance plots and resiudal plots without the problematic data entry generated relatively simlar plots as the figures XX and XX. The main difference is the largest two outliers in the figures XX and XX.

Comparison of sensitivity and specificity of optimal model v.s. optimal model with outlier Olmsted removed
Model AIC BIC Sensitivity (%) Specificity (%) Pseudo R2
Optimal 480 504 75 74 0.37
Optimal without outlier 466 491 74 75 0.39

The model with the outlier removed performs better.

Discussion

In order to first get a view on the different covariates and how they relate to each other, they are plotted against eachother in figure .

#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
\label{fig:pairs}Plot of covariates against eachother

Plot of covariates against eachother

The optimal model includes the previously studied covariate region, but discards the higrads covariate. In addition, it includes the new poors and phys1000 covariates. One explaination why higrads is not used in the optimal may be seen in figure , where there is a high correlation between higrads and poors. With incresed multicolinearity the standrard error of the coefficents increase. In worst case this can cause some varibles to not be significant. Problems such as these are not to unlikley to happen when ivolving both the region and higrads covariates, which have a correlation of -0.692 bewteen them. This is probably one of the reasons why the optimal model performs better than the interaction model in the previous section, which had both the higrads and region as covarites.

This hypothesis is tested in figure , where a linear regression model has been fit. Studying the P-value of the model reveals that the \(\beta\)-values are highly significant.
\label{fig:poors_higrads}Plot of \texttt{higrads} against \texttt{poors}, together with linear regression line, with 95 % confidence and prediction intervals

Plot of against , together with linear regression line, with 95 % confidence and prediction intervals

Looking at how well poors predicts hicrm may be seen in figure . Here it may be seen that poors follow a more distinct S-shape, and that poors seems to split the dataset more distinctly between high and non-high crime rate, as it varies from \(\approx 15% - \approx 80%\), rather than the low separation discussed previously. As such,it seems that higrads and poors are highly correlated, but that poors better predict hicrm and is therefore better left in the model.

\label{fig:hicrm_poors}Plot of \texttt{hicrm} against \texttt{poors}, including kernel smoothing and prediction of fitted model with 95 \% confidence interval

Plot of against , including kernel smoothing and prediction of fitted model with 95 % confidence interval

Regarding phys1000, it appears in that it does not have an as clear relationship to the other covariates and therefore provides more information to the model. Looking at how well phys1000 predicts hicrm, seen in figure , it seems to follow an approximate S-shape and therefor contributes to the model.

\label{fig:hicrm_phys1000}Plot of \texttt{hicrm} against \texttt{phys1000}, including kernel smoothing and prediction of fitted model with 95 \% confidence interval

Plot of against , including kernel smoothing and prediction of fitted model with 95 % confidence interval

To Summarize, the optimal model ignores the higrads covariate and use region, poors and phys1000 as covariates. Doing so the